Syllabic Pitch Tuning for Neutral-to-emotional Voice Conversion
نویسندگان
چکیده
Prosody plays an important role in neutral-to-emotional voice conversion. Prosodic features like pitch are usually estimated and altered at a segmental level based on short windowing of speech signal (where the signal is expected to be quasi-stationary). This results in a frame-wise change of acoustical parameters for synthesizing emotionalized speech. In order to convert a neutral speech to an emotional speech from the same user, it might be better to tune the pitch trajectory at the supra-segmental level like at the syllable-level since the changes in the signal are more subtle and smooth. In this paper we aim to show that the pitch tuning in a neutral-to-emotional voice conversion system may result in a better speech quality output if the tuning is performed at the supra-segmental (syllable) level rather than at frame-level. Subjective evaluation results are shown to demonstrate the improvements in terms of naturalness and speaker similarity.
منابع مشابه
Body size projection by voice quality in emotional speech—Evidence from Mandarin Chinese
This study attempts to extend the line of research on using body size projection theory to account for emotional speech. It is predicted by the theory that anger is expressed by projecting a large body size with low pitch, rough voice and long vocal tract; happiness is expressed by projecting a small body size with high pitch, breathy voice and short vocal tract. Ten native speakers of Mandarin...
متن کاملGMM-based voice conversion applied to emotional speech synthesis
Voice conversion method is applied to synthesizing emotional speech from standard reading (neutral) speech. Pairs of neutral speech and emotional speech are used for conversion rule training. The conversion adopts GMM (Gaussian Mixture Model) with DFW (Dynamic Frequency Warping). We also adopt STRAIGHT, the high-quality speech analysis-synthesis algorithm. As conversion target emotions, (Hot) a...
متن کاملText-independent F0 transformation with non-parallel data for voice conversion
In voice conversion, a simple frame-level mean and variance normalization is typically used for fundamental frequency (F0) transformation, which is text-independent and requires no parallel training data. Some advanced methods transform pitch contours instead, but require either parallel training data or syllabic annotations. We propose a method which retains the simplicity and text-independenc...
متن کاملVokinesis: syllabic control points for performative singing synthesis
Performative control of voice is the process of real-time speech synthesis or modification by the means of hands or feet gestures. Vokinesis, a system for real-time rhythm and pitch modification and control of singing is presented. Pitch and vocal effort are controlled by a stylus on a graphic tablet. The concept of Syllabic Control Points (SCP) is introduced for timing and rhythm control. A ch...
متن کاملStatistical Variation Analysis of Formant and Pitch Frequencies in Anger and Happiness Emotional Sentences in Farsi Language
Setup of an emotion recognition or emotional speech recognition system is directly related to how emotion changes the speech features. In this research, the influence of emotion on the anger and happiness was evaluated and the results were compared with the neutral speech. So the pitch frequency and the first three formant frequencies were used. The experimental results showed that there are lo...
متن کامل